Artificial intelligence detects awareness of functional relation with the environment in 3 month old babies

A recent experiment probed how purposeful action emerges in early life by manipulating infants’ functional connection to an object in the environment (i.e., tethering an infant’s foot to a colorful mobile). Vicon motion capture data from multiple infant joints were used here to create Histograms of Joint Displacements (HJDs) to generate pose-based descriptors for 3D infant spatial trajectories. Using HJDs as inputs, machine and deep learning systems were tasked with classifying the experimental state from which snippets of movement data were sampled. The architectures tested included k-Nearest Neighbour (kNN), Linear Discriminant Analysis (LDA), Fully connected network (FCNet), 1D-Convolutional Neural Network (1D-Conv), 1D-Capsule Network (1D-CapsNet), 2D-Conv and 2D-CapsNet. Sliding window scenarios were used for temporal analysis to search for topological changes in infant movement related to functional context. kNN and LDA achieved higher classification accuracy with single joint features, while deep learning approaches, particularly 2D-CapsNet, achieved higher accuracy on full-body features. For each AI architecture tested, measures of foot activity displayed the most distinct and coherent pattern alterations across different experimental stages (reflected in the highest classification accuracy rate), indicating that interaction with the world impacts the infant behaviour most at the site of organism~world connection.

Artificial neural networks (ANNs) were first developed to understand biological behaviour and mechanisms of cognition 1 .Turing believed that a crucial measure of artificial intelligence (AI) is the ability to mimic the way in which complex behaviour becomes organized in infants 2 .Given recent technical advancements in computing and AI as well as theoretical advancements in infant learning, it may be possible to use machine and deep learning techniques to study how infants transition from loosely structured exploratory movement to more organized, intentional action.Thus far, such methods have focused on analyzing spontaneous movements 3 and distinguishing fidgety from non-fidgety movements 4 .
Though early infant movement is chaotic, meaningful patterns emerge as infants adapt to external perturbations and constraints through dynamic interaction between brain, body and environment 5,6 .However, little is known about the mechanisms by which infants begin to intentionally act on functional relationships with their environment.Laws governing bidirectional interaction between infant and environment are lacking, and the roots of conscious, coordinated, goal-directed action remain largely unexplored 6,7 .
A paradigm designed a half century ago to study infant memory and learning provides an experimental window into the formation of human agency, action towards an end 6 .In this so-called mobile conjugate reinforcement (MCR) paradigm, Rovee et al. connected a ribbon between an infant's ankle and a mobile suspended over the infant's crib 8,9 .Conjugate reinforcement refers to the sights,noises, and sensations due to mobile movement all being ostensibly dependent on and in proportion to the magnitude and rate of infant action.In short, the thinking was that the more the infant moved, the more 'reward' the mobile provided, stimulating further infant movement.Infants moved the connected leg at much higher rates compared to baseline, which Rovee and Rovee 8 interpreted as reinforcement learning.However, mounting evidence suggests that rather than being skeleton definition to translate and rotate the virtual camera's viewpoint to generate a new coordinate system.This produced a robust and adaptive neural network for automatic, optimal spatiotemporal representation.In many respects, action recognition is more straightforward using skeletal joint information than RGB video imagery, and hence is preferred here.
The data used in the present work were obtained from a baby~mobile experiment which quantitatively identified moments of agentive discovery and explored underlying mechanisms through analysis of infant motion, mobile motion and their coordination dynamics 20,48 .Movement data were collected using a Vicon 3D MoCap system.Given the nature of this experiment and the fact that the MoCap data provide exact infant joint locations, several machine and deep learning approaches for classifying pose-based features are proposed and evaluated here.The first main objective is to classify infant movement across different experimental stages, ranging from spontaneous activity to reactions to an externally driven moving stimulus (here a mobile) to make the mobile move to losing control over the mobile.Successful classification would indicate that the functional context and infant sensitivity to context drive changes in the structural features of infant movement.The classification accuracy of different experimental stages will highlight the pros and cons of applying different machine and deep learning approaches in infant studies and, at the same time, expose whether behaviour is more constrained and characteristic within certain contexts, leading to higher accuracy classification.The second main aim is to study temporal features of infant behaviour using sliding windows.To reiterate, the infant's discovery that 'I can make the mobile move' emerges from a coordinative dance between organism and environment which unfolds in time 20 .Assessing classification accuracy using sliding windows will provide new insight to the dynamic topological evolution in infants exploring and discovering their relationship to the world.With our unique dataset and in the context of related evidence of infant agentive discovery, we examine and optimize machine and deep learning methods to characterize the flow of structural change in infant movement reflective of cognitive processes and behavioural adaptation within and across coordinative contexts.We demonstrate that approaches based on deep learning are well-suited for working with pose-based data.In particular, we show that CapsNet-based approaches, such as 1D-Capsule Network (1D-CapsNet) and 2D-CapsNet preserve the hierarchy of features and avoid information loss in the model's architecture by substituting pooling with dynamic routing especially when fused features were employed.More to the point, we demonstrate that AI systems provide significant insight into the early ability of infants to actively detect and engage in a functional relationship with the environment.

Experiment
Sixteen reflective markers were placed on 3-4-month-old babies (N = 16) to track infant movement.All experimental protocols were approved by Florida Atlantic University (FAU) and Florida Department of Health (DOH) Internal Review Boards.This experiment was performed in accordance with FAU and Florida DOH guidelines and regulations.Informed consent was obtained from a parent and/or legal guardian of infant for both study participation and publication of identifying information/images in an online open-access publication.The marker arrangement can be seen in Fig. 2a.In each trial, the infant was placed in a crib face-up, with a mobile suspended above.The mobile consisted of two colourful blocks attached to the ends of a wooden arm.Seven infrared cameras surrounding the crib tracked each marker's position at a rate of 100 Hz using a Vicon MoCap system.Three video cameras placed at different angles were also used to record each session.The experiment consisted of four experimental stages.
In the spontaneous baseline (2 min.long) the mobile was stationary.In the uncoupled reactive baseline (2 min.), the experimenter triggered mobile movement independent of infant movement.Two strings were connected to the sock of one foot (the trigger foot, visible in Fig. 2b), commencing the tethered phase (5-6 min.).The side of the body selected as the trigger foot was randomized.When the strings connected to the trigger foot were pulled, the mobile rotated.Infant trigger foot movement was digitally translated into rate of mobile rotation.The strings were then detached, leaving the mobile stationary once again constituting an untethered phase (2 min.).These four stages respectively measure infant activity (1) when there was no mobile motion and infant activity was spontaneously generated, (2) when mobile motion was an externally driven stimulus that was not affected by infant motion, probing the infant's reaction to the raw stimulation a moving mobile provides, (3) when mobile motion was a direct result of infant motion, and (4) when infant motion no longer resulted in mobile motion (Fig. 1).An illustration of the extracted 3D infant skeleton is shown in Fig. 2c.

Pipeline overview
An overview of the feature extraction and classification is shown in Fig. 3.As will be described in the following section, we first performed preprocessing to ensure the accuracy of the timeseries for the keypoints¸ the set of 3D movement markers extracted by the MoCap system.Important information was then extracted from preprocessed sequences categorized into distinct experimental stages.In the next step, we used a spherical coordinate system to segment the 3D joints into an n-bin histogram.A histogram-based approach was employed to create pose-based feature sets called Joint Displacements (JDs) to feed as inputs to the networks for experimental stage classification.The following sections of the paper explain each of these steps in detail.Finally, the machine and deep learning approaches tested here are described and justified.

Dataset description and preprocessing
The input data for baby movement pattern recognition are a sequence of 3D skeletal joint positions (keypoints) collected during the experiment using a marker-based MoCap system collecting at 100 Hz.For the current analysis, five infants were selected from the larger pool of participants because these infants had full datasets for all necessary markers throughout the experiment.Due to challenges inherent in motion capture with infants, such as their small size and rapid, unpredictable, and sometimes contorted movements, some markers were occasionally obscured or not captured effectively by the Vicon system.Additionally, infant compliance is not guaranteed for the experiment's entirety.An infant was excluded if a single marker was missing for the duration of any phase of the experiment.Consequently, only data from five infants met the stringent criteria for data completeness necessary for our analysis.The Vicon MoCap system used to collect these data provides a multiview video dataset as well as a three-dimensional keypoint locations for baby movements that can compensate for RGB images' disadvantages 49,50 .To visually introduce the data, using frames of video the position of the baby's limbs changed constantly throughout the experiment (Fig. 4). Figure 5 presents a single frame from each of the four stages.Figure 5a, b, and d are examples where the ribbon is not connected to the trigger foot.In (a) and (d), the mobile does not move.In (b), the mobile moves when triggered by the experimenter.The baby can only activate the mobile once the ribbon has been attached to the foot visible in (c).To clarify, however, only 3D keypoint data (not video) was used for AI training and analysis.www.nature.com/scientificreports/ In every frame of data, the keypoints are labelled based on the anatomical locations of the movement markers in the frame.These keypoints are as follows: Head, C Pelvis, L/R Pelvis, L/R Shoulder, L/R Hand, L/R Hip, L/R Thigh, L/R Knee, L/R Ankle, L/R Foot (see Fig. 6).However, a subset of these skeletal joints were included in the current study since complete data were available for only some of the markers.For the current project, we used marker data for 12 keypoints: Head, Center Pelvis, L/R Hip, L/R Shoulder, L/R Hand, L/R Knee, and L/R Foot.As a result, the dataset contains 12 3D sequences.
Various hurdles to 3D motion capture exist with infants such as marker jitters either due to system errors, extraneous reflections, or the mode of affixing markers to the infant's body.For example, socks are made from stretchable fabric, allowing markers attached to the sock to wobble more than markers which are taped directly to the skin (a more rigid connection, that babies don't particularly like).Other challenges include insufficient coverage of the infant's limb motion either due to occlusion of the markers by vicissitudes of the experimental setup (i.e., at times, the mattress occludes almost 50% of the infant's body, the mobile feedback system and its support beams) or by other body parts (i.e., stacking one foot on top of the other, grabbing feet with hands).The foregoing issues along with the complexity of the infant's posture (e.g., crossing legs) can prevent the Vicon system from correctly identifying and tracking the keypoints.Missing or incorrect marker data were handled using interpolation and filtering.Following that, we extracted features from the keypoint coordinate sequences that were preprocessed.Although keypoint missing or estimation errors may exist in a specific sequence input { J it (x, y) i = 1, 2, . . ., 12 }, there are a relatively small number of frames with errors in estimation.The three times standard deviation technique, where σ is the standard deviation of the sequence { J it }, was utilized for error elimination (1).Here E J is the average of { J it }.
The trajectory of a joint's location is continuous, and its velocity fluctuates uniformly during movement.Cubic spline interpolation was employed to fill in missing points.Ignoring frames with missing data may lead the actual movement to deviate from the subsequent calculation of movement complexity.

Dataset annotation
A primary aim of this work is to classify infant behaviour across experimental contexts and to explore whether structural changes in infant movement detected by AI converge on coordinative and dynamical indicators of infant agency.Importantly, dynamic and coordinative indicators of agentive discovery suggest that most infants' Aha!' moments occurred after 90 seconds of infant~mobile coupling 20 .Therefore, we split the tethered phase across time to investigate whether the various classification architectures employed here could differentiate between infant initial exploratory movement during interaction (i.e, CR1 -the first minute of infant~mobile coupling) from possible intentional activity later in the tethered phase (i.e, CR2 -beginning at 120s into coupling) (see Figure 7).Movement sequences were divided into five stages and annotated as follows: B1 (spontaneous baseline, no mobile motion), B2 (uncoupled reactive baseline, experimenter-triggered mobile motion), CR1 (the first minute of the tethered stage), CR2 (~ 2 min.into the tethered stage), DC (decoupled, untethered stage).For each infant, the five stages (B1, B2, CR1, CR2, DC) are labelled as 0 to 4, respectively.In subjects 1, 3, and 5, the ribbon was connected to the left foot; in S2 and S4, the right foot was connected.Figure 8 depicts the available data for all five infants across all stages.Due to the nature of experiments with human babies, uneven stage lengths are inevitable (e.g., CR1 for S4 contains only 18.94s of data).

Spherical coordinates of movement directions
Vicon data represent the position of markers relative to an arbitrary origin of the 3D capture space.A map of body-centric displacements can be created by designating a root joint marker and calculating displacement of other joints relative to the root.Here, we transferred the Center Pelvis to the centre of our spherical coordinates (the root) and then calculated all joint displacement between each joint and the centre (0, 0, 0).Thus, all analyses were run on joint displacements calculated from body-centered positional 3D motion capture data.A modified spherical coordinate system was then used to partition 3D space into n bins.Spherical coordinates align with the infants' specific movement directions 1 .To elaborate, spherical coordinates are assigned to be view-invariant, meaning descriptors of the same type of infant position are similar even when collected from different perspectives (Fig. 8).To create a compact representation of infant positions, we chose six informative joints (L/R hand, L/R knee, and L/R foot).The 3D space is partitioned into n bins by reference vectors alpha and theta, horizontal from L Pelvis to R Pelvis and perpendicular to the coordinate centre; therefore, any 3D joint can be localised at a specific bin 51,52 .

Estimation of the poses using histogram-based features
Action recognition techniques based on histograms using a set of feature vectors extracted from the skeletal joint locations allow us to estimate 3D pose in lower dimensions relative to standard methods 25 suited for pose estimation from 2D video data (e.g., 53,54 ).Histogram-based features derived from baby joint displacement have been successfully employed in a deep learning architecture to classify infant movements 55 .Using this approach, the displacement of each joint is extracted for a 5s snippet, and then a feature called Histogram of Joint Displacement (HJD) is obtained.HJDs track which particular joint occupies each spherical coordinate and for how much time.These HJDs are used as inputs for the machine/deep learning systems tested here.The partitioning of the 3D space into n bins is referenced by vectors alpha and theta (right).

Classic machine learning approaches
Vol:.( 1234567890)  24 , providing a reasonable baseline for our evaluation performances.

Deep learning approaches
Recent studies have demonstrated that deep learning frameworks can be successfully applied to human action recognition 30 .However, several obstacles exist, mainly the large amount of data required to enable deep learning and the difficulty of creating explainable AI when applying deep learning to infants' activities and their healthcare domain 56,57 .Understanding how a framework makes decisions is critical in these domains.Yet, deep features in deep learning frameworks are often incomprehensible to humans.Hence, our deep learning frameworks employ hand-crafted pose-based feature sets to classify infant movements.The histogram vectors of the posebased features were evaluated using different deep learning architectures, including a Fully connected network (FCNet), 1D-Convolutional Neural Network (1D-Conv), 1D-Capsule Network (1D-CapsNet), 2D-Conv, and 2D-CapsNet.A range of hyperparameters for each architecture was optimized using a cross-validation approach.
The result section describes the model's parameters and other statistics such as the size of the input and output layers, kernels, and the number of filters.
In addition to the comprehensive exploration of deep learning frameworks and the challenges faced, we have provided detailed explanations and specifications in the Supplementary Information, demonstrating the complexities of our model's parameters, input and output layers, kernels, number of filters, and other relevant statistics.

Training and testing inputs: dealing with imbalanced data
Using five subjects with uniformly available joint data we employed a leave-one-out (i.e., one subject) validation approach.The leave-one-out approach involved training each model using four out of the five infants' HJDs and testing model accuracy using the infant HJD dataset which was left out from training.This procedure was repeated five times so that each infant dataset was used for testing once.Given the uneven quantities of available data for each experimental stage across infants (see Fig. 7), the data were down sampled for each instance of training to the shortest stage length among the four infants used for training.For example, when subjects 1-4 were used for training, the shortest stage for any of these four infants was CR1 for S4 (18.94s).Therefore, all stages for subjects 1-4 were downsampled to 18s, producing an equally balanced training set across infants and across stages.In our training process, we selected 1794 samples for each classification stage, with the quantity restricted by the count observed in stage CR1, which exhibited the lowest sample number.The remaining stages were randomly downsampled to ensure balanced labels.We obtained accuracy percentages for each instance of testing and averaged them (Table 1).In addition, to better understand the temporal evolution of infant movement, an overlapped window sliding strategy was employed that used a window width of 5 seconds with 1 second overlapped across all experimental stages.This allowed us to assess fluctuations in classification accuracy across time throughout the experiment (illustrated with small black windows and arrows in Fig. 7).Average classification accuracy on sliding windows is reported as a measure of model performance.Analysis of variance (ANOVA) is used to determine the significance of differences between the results of each method ( p < 0.05).The classification accuracy achieved for L/R foot individually and fused Feet was superior to that of all other feature sets.In particular, 2D-CapsNet achieved the highest accuracy of 86.25% for the fused Feet feature set.In terms of mean joint type accuracy, all foot features (i.e., L, R and fused) scored significantly higher average accuracy across all evaluated classification methods compared to other joint types (F(2, 48) = 34.65,p < 0.0005).
LDA and kNN achieved highest accuracy for single joint features.For example, LDA achieved the highest accuracy of 59.63% and 75.63% for the left hand and left foot, respectively, whereas kNN achieved 58.00% and 61.63% for the right hand and left knee, respectively.
However, when we considered fused features, deep learning approaches outperformed LDA and kNN.All fused features improved in accuracy using deep techniques.When spatial information between different body parts was used during movement, 2D deep approaches performed better: 2D-Conv and 2D-Capsnet achieved an accuracy of 59.57% and 86.25% for Hands and Feet, respectively.Even though 2D-CapsNet achieved a lower accuracy for Hands (50.65%), a comparison of the mean accuracy per classifier demonstrated that 2D-CapsNet maintained the highest mean accuracy at 65.65.Also, the fused feature of Full-body that represent the most general representation of 3D skeletal joint information achieved the highest accuracy (65.51%) using the 2D-CapsNet approach.In sum, though all models tested performed well above chance (20%), 2D-CapsNet models using fused feature inputs made the most accurate classifications.

Using the 2D-CapsNet to assess stage transitions
A sliding window was used to analyze each infant's movement in conjunction with the leave-one-out approach.For each infant, we shifted a window of 5s with 1s overlapping across one stage while keeping other windows stationary in the remaining stages.Figure 9a-e shows the moving average of the temporal analysis for each infant using the fused Feet feature.As can be observed, five ascending steps illustrate the correct labels for five distinct classes.Stage B2 received the most accurate labels, indicating that it is easier to detect Histograms of Joint Displacements (HJDs) accurately compared to other stages.This observation is echoed by the confusion matrix, where Stage B2 consistently received the most accurate labels (refer to Supplementary Information Figure 8).Additionally, the classifier did not perform well when it came to identifying spontaneous movement in stage B1, indicating that infants moved variably such that the system detected movements in B1 which were like movement patterns in multiple other stages.This becomes more evident when considering the moving average of classification accuracy for temporal analysis of fused features Knee, Hand, and Feet (Figure 10).B2 has a higher overall classification accuracy over the whole period than the preceding stage.Additionally, as we moved along after decoupling (in stage DC), we observed a decline in classification accuracy, indicating that movements were more irregular after the mobile stopped responding to infant movement.
In addition, a higher rate of misclassification between each stage and the stage immediately following was observed (see Figure 8 of the Supplementary Information), which is a common occurrence in classification tasks involving closely related categories.This can be attributed to the inherent similarities in features characterizing neighbouring stages.In the first baseline (B1), babies moved spontaneously in an undifferentiated manner, causing misclassification with later stage activity, notably B2.Many infants froze, reducing their activity when the experimenter triggered the mobile during the B2 (cf.Table 2, Figs.11, 12).Similarly, when infants were first tethered to the mobile (CR1), infants alternated between moving and freezing 20 , so confusion between B2 and CR1 is understandable.However, periods of freezing progressively shortened in length over the course of the first minute of tethering as infants acclimated to the mobile and their activity ramped up.Since CR1 and CR2 are samples derived from the same overarching experimental stage, it is reasonable to anticipate shared features in babies' movements during these stages.Finally, infants generally remained quite active after the tether was disconnected (stage DC).Classification accuracy was lowest in the DC stage.Given the proximity of CR2 to DC in the classification hierarchy, it is reasonable to expect that misclassifications would predominantly involve labels transitioning to the DC stage.In sum, the directionality of misclassification likely reflects this flow of the experimental design and the fact that these data represent the behaviour of naïve 3-month-old infants exploring and adjusting their behaviour within a novel and changing functional context.

Feature analysis
We first computed the average displacement rates of both feet across the stages to further understand the significance of higher classification accuracy levels observed during B2 and across CR (see previous section).As can be seen in Table 2, the feet were significantly more active during DC compared to the other stages (F(4, 16) = 3.52, p = 0.03, sphericity assumed), and infants tended to move their feet less during B2 compared to the other stages.Based on average foot movement rate alone, accurately classifying B2 and DC foot behaviour would appear to be easier task compared to classifying stages B1 and CR which had similar movement rates.
Infant learning during coupling has classically been defined as 150% increase in movement rate relative to spontaneous baseline 8 .While this cut-off was not met by these five babies as a group, the infants do increase movement rate roughly 150% during tethering relative to the second baseline when the experimenter moved the mobile.Given that both the coupled phase and second baseline both involve mobile movement, the second Vol:.(1234567890  baseline is a fair comparison point contextually speaking.Furthermore, averaging movement rate across babies does not address individual behaviour or discovery processes.Finally, it is possible that infants discover specific trajectories that are particularly suited to elicit mobile response.Upon discovering the relationship between foot movement and mobile motion, infants may adjust foot movement topology rather than (or in addition to) foot movement rate.These infants do in fact change their foot trajectories as is reflected by the high accuracy rates depicted in Fig. 11.
To explore topological changes in infant joint movement, we examined the actual trajectory of the connected feet for Subject 1 (left foot) and Subject 2 (right foot) (Figs.11, 12).As can be observed, random movements in stage B1 became more constrained in both subjects after the examiner activated the mobile (B2).Topologically, infants produced almost the same movement trajectories for stages CR1 and CR2.S2 maintained similar activity in DC compared to CR, indicating lasting effects of baby-to-world connection even after that connection is severed.
To investigate how changes to features of infant movement directly inform deep learning systems, we quantified the HJDs of all joints for S2 over 10s in the middle of each stage (bins = 64) (see Fig. 13).Note that whereas the x-axis of each histogram reflects the number of bins (or the range of 3D space) a joint occupies, the y-axis reflects the amount of time the joint occupied a particular bin.Therefore, a large HJD value constrained to a single bin reflects little joint motion or, at most, very topologically restricted motion.Although most joints during B1 are clearly separated to the left and right of the histogram graph space in accordance with their anatomical sides (e.g., left hand (blue) occupies bins on the left of the histogram), the right foot (black) is visible in many bins between bins 8-57, implying random movement of the right foot in stage B1.The right foot movement became more restricted throughout stage B2 (bins 48 and 56), supporting the observations made in the preceding section (Results, C).There was a noticeable increase in the range of this joint's position across the CR stage (activity was between bins 46-57 in CR1 and expanded between bins 32-57 in CR2).Although the foot was active in a larger volume in CR than B2, activity became constrained again in the DC stage, like B2. Importantly, this evolution of activity was only observed in the foot linked to the mobile (the right foot) and not in any other limb segments.

Comparison of AI approaches: which AI approach works best and why?
We aimed to use machine and deep learning approaches to identify infant behavioural changes in response to changes in functional context.Both machine and deep learning techniques accurately classified infant Pose-Based  features (HJDs) throughout the experiment.All techniques tested reached accuracies higher than chance (20%) for all joint types (Table 1), although deep techniques generally outperformed machine learning techniques for any given joint type (Table 1).In particular, 2D-CapsNet achieved the greatest accuracy level (accuracy for fused Feet= 86%).Thus, 2D-CapsNet maximised spatial information between different body parts during movement, demonstrating the architecture's high-performance ability to generate feature hierarchy.

Comparison of joint classification across AI approaches: which joint has the most distinctive patterns and why?
For every architecture tested, whether of machine or deep learning, some measure of the feet achieved the highest classification accuracy rates (~ 20% higher than the fused hands, knees or whole body), indicating that the feet had the most distinctive pattern changes across the various experimental stages in response to the mobile's changing behaviour and the infant's functional connection to the mobile.

Do dynamics of classification accuracy reflect infant discovery?
The classification accuracies of several joints were assessed continuously over time to investigate how infants adapt to functional context and whether accuracy dynamics indicate that infants transition from exploring their relationship with the mobile to intentionally directing mobile movement.Again, classification accuracy was strongest by far for the feet, the end effector, compared to other body parts regardless of experimental stage (Fig. 10).Accuracy of the feet fluctuated around 80% during the spontaneous baseline, a lower level than when the mobile moved either at the hand of the experimenter during the second baseline or at the foot of the infant during coupling.These results taken together with the large spread seen in the first baseline histogram of one infant (each of Subject 2's feet appear in 10 histogram bins, Fig. 13a) denote that infants' movements are more widespread topologically and therefore harder to classify when the infant is moving spontaneously without any active environmental stimulation.On the other hand, once the experimenter began triggering the mobile, classification accuracy jumped up, hovering around 90% throughout the second baseline.This spike in classification accuracy is related to the concentration of foot activity within 3D space during the second baseline (e.g., each of Subject 2's feet appear in only one or two histogram bins, Fig. 13b).Soon after infant~mobile coupling began, there was a momentary dip in accuracy (Fig. 10), possibly reflecting infants' initial probing of a variety of trajectory orientations.However, accuracy levels spiked in the second minute of coupling (hovering again around 90%), likely reflecting infant discovery of trajectories particularly suited for triggering mobile motion (e.g., the switch to Z-orientation seen in individual infants as they raised and lowered their foot to trigger mobile motion, Figs.11-12).These AI results converge on our dynamic and coordinative analyses which likewise indicate that  the transition from initial spontaneous exploration to goal-directed action during infant mobile coupling can be characterized by initial expansion and later minimization of movement variability 20 .In sum, classification accuracy fluctuates within and across experimental stages and seems to reflect processes of exploration and discovery.
However, what can we make of the fact that classification accuracy levels were similar when the experimenter triggered the mobile and during coupled interaction?Since histograms of joint displacement were used as inputs to the AI architectures tested here, stark changes in overall quantity of limb activity from stage to stage are likely to be relevant to stage classifications.Accurately classifying foot behaviour based on average foot movement rate alone seems to be more straightforward during the second baseline (when infants were least active compared to the spontaneous baseline and the coupled stage as the latter two stages had similar movement rates (Table 2)).Therefore, although similar classification accuracy rates were obtained for the feet in the second baseline and during coupling, the high classification accuracy results during the second baseline are largely explained by the overall reduction in movement rate.Simply put, it is easy to identify data from the second baseline because the babies move much less.In contrast, the high classification accuracy during coupling indicates that foot movement changes topologically during coupling (i.e., from the first minute to the second minute) and across the stages.Our AI architectures use those qualitative changes (not merely quantity of movement) to classify behaviour.Though infant contingency studies traditionally infer learning when movement quantity is much greater during coupling compared to spontaneous baseline, our results also demonstrate that qualitative changes in infant movement (Figs. 9, 10, 11, 12, 13) can be detected even when average movement rates do not differ significantly.At the same time, it confirms that a qualitative change has taken place within coupling and across the experiment.
After the connection between infant and mobile was severed, activity spiked (Table 2) and accuracy quickly declined, plunging even lower than spontaneous baseline (Fig. 10).Infants appear to switch from directed movement during coupling to varied, exploratory movement after decoupling.As a group, they explore to a greater degree after being disconnected than before they were exposed to the possibility of controlling the mobile.It is as if the infants' search for connection to the world has intensified because of the lost linkage.However, certain infants' trajectories in the decoupled stage reflect a lasting imprint of coupling (e.g., S2, Fig. 12).It is possible that only some infants recognize their functional relationship to the mobile and so only these infants maintain the patterns which emerged during coupling after decoupling, expecting those unique patterns to elicit the mobile response.If the accuracy rate during decoupling persists at the same high level observed during coupling or dips below the lower accuracy rate seen during spontaneous movement, either pattern could indicate infant discovery during coupling.Using classification accuracy to identify infants who discover their ability to make the mobile move is complicated by the fact that different behaviours could indicate infant discovery for different reasons.Identifying infant discovery likely requires piecing together clues from various experimental stages.Therefore, temporal analysis of classification accuracy for each infant across the entire experiment (Fig. e 9) provides an important tool for understanding individualized processes of exploration and discovery.It is important to recognize that regarding the science of goal-directed action, the infant researcher is at a major disadvantage in comparison to the adult researcher for the simple reason that adults can speak and understand verbal instructions.One can ask an adult to execute a specific task or ask the adult participant what they intended to do.Given that machine and deep learning architectures detect qualitative changes in (prelinguistic) infant movement related to functional context and goal-directedness, AI analysis offers new levels of insight into behavioural~cognitive processes of preverbal infants.

Limitations and future directions
As in any formal comparison of machine and deep learning techniques, future research may explore the effects of other hyperparameter settings on classification accuracy rates.Although the histogram-based techniques used here condensed 3D movement data to a lower dimensionality by grouping data into bins, a comprehensive representation of the data was still preserved.Also, movement was represented using feature descriptors that are intended to simplify and extract useful information.As a result, rather than performing a frame-by-frame comparison, we were able to examine the distribution of the displacement of all joints across a time window.Future studies may incorporate the actual recorded video and skeletal joint information involving more infants and employ Recurrent Neural Networks (RNN) and more advanced Transformer and attention-based models to investigate the feasibility of predicting or decoding the temporal information of the input posture sequence.In addition, other advanced pose-based features can be used to evaluate the proposed network architectures 58 .
The results demonstrate accurate classification at the granularity of experimental stages but, as we have applied a windowing approach (5s wide and 1s overlap) we have demonstrated that we can classify these windows with an averaged accuracy of ~86% across infants (transferability of classifier to new unseen participants).This demonstrates that the proposed AI framework could be applied online, in real-time for labelling within experiments and using the knowledge provided by the classifier to adapt the experimental paradigm in real-time to investigate more complex neurodevelopmental processes.
A notable limitation of our study is that the current analyses were restricted to five infants.Increasing the sample size would enhance our present results.Nevertheless, it's impressive that we can distinguish distinct changes in such a small sample.One possible path to recover the entire sample featured in the original experiment would be to use pose estimation on 2D videos to extract 3D positional data from the infants with incomplete motion capture datasets.The technology to accomplish this is in development.Also, a key goal for future work is directly connecting AI topological classifier dynamics (cf Figs.www.nature.com/scientificreports/Beyond unravelling the way infants discover to stabilize control over naturalistic interactions with the world, a basic issue of cognitive development (as well as artificial intelligence), applying AI to identify context-dependent changes in individual infant behaviour has obvious clinical relevance.Function~dysfunction is in many scenarios definitional to diagnosis.However, the General Movement Assessment, one of the most widely used tools for predicting and diagnosing motoric and cognitive disorders from spontaneous infant movement, does not address an infant's functional movement abilities 59,60 .Given this disconnect, it is, therefore, no wonder that many recent AI efforts to classify and predict clinical dysfunction from infant behaviour have only considered spontaneous infant movement 4,[61][62][63][64][65][66][67][68][69] or that datasets made publicly available to train and test machine learning algorithms for classifying normative and atypical behaviour related to disease/disorder is predominantly composed of spontaneous behaviour data [70][71][72] .The baby~mobile paradigm offers an ethological means to assess individual infant motor and cognitive behaviour in both spontaneous and functional contexts [73][74][75][76][77][78] and a potential means for personalized intervention 79,80 .We are developing a multi-pronged program of research built on synergizing theory, experiment, coordination dynamics modelling and analysis, active inference Bayesian modelling, and AI analysis aimed at understanding the emergence of sentient agency in lawful terms 81 and advancing individualized, functional diagnosis and treatment capabilities for at-risk infants.

Conclusion
Both machine and deep learning techniques successfully classified randomly sampled five-second snippets of 3D infant movement from different experimental stages using histogram features of pose-based information.The deep learning 2D-CapsNet, however, achieved the highest accuracy classification rate.Critically, for every architecture tested, whether of machine or deep learning, some measure of the feet achieved the highest classification accuracy rates.Without informing these AI systems about the experimental design or the connectivity of the foot to the mobile, every architecture tested speaks the same message: the feet, the end effectors, are most uniquely affected by the baby~mobile interaction.Put another way, the functional connection to the world influences the baby most where it matters, namely at the point of infant~world connection.The information flow between agent and world is palpable; it is bodily felt and enacted.In contrast to traditional infant contingency studies which rely upon changes in movement quantity to infer learning, our AI classifiers offer a means to automatically detect characteristic topological changes across the experimental stages.Thus, AI systems can play a significant theoretical role, namely by providing insight into the early ability of infants to actively detect and engage in a functional relationship with the environment.Additionally, assessing dynamics of AI classification accuracy for each infant opens a new avenue for unravelling when and how individuals engage with and discover their relationship to the world.Whereas previous AI approaches have focused on classifying spontaneous infant movement in relation to clinical outcomes, pairing theory-driven experimentation with AI tools will ultimately allow us to develop more robust, context-dependent and functionally relevant assessments of infant behaviour for risk, diagnosis and treatment of disorders.

Figure 1 .
Figure 1.(a) Infant marker layout.(b) Infant in crib with mobile above.A 3D reconstruction of the markers is overlaid on top of the video recording.All coloured lines represent the virtual skeleton of the 'baby-mobile object' being tracked.The coloured dots are the reflective markers.(c) A 3D representation of another infant without the video feed.

Figure 2 .
Figure 2. Schematic of the experimental paradigm.The experimental stages which manipulate the infant's functional context proceed from left to right.See the Experiment description.

Figure 3 .
Figure 3. Overview of the feature extraction and classification steps.

Figure 4 .
Figure 4. Different frames from one video camera viewpoint of subject 1. Reflective markers are easily visible on this infant's right leg.

Figure 5 .
Figure 5. 3D reconstructed tracking during four different stages in the experiment: (a) spontaneous baseline-no mobile movement (b) uncoupled reactive baseline -experimenter moves mobile (c) tethered stage (d) decoupled stage.

Figure 6 .
Figure 6.x, y, and z coordinates of body landmarks.

Figure 7 .
Figure 7.The top panel shows an idealized progression of experimental stages depicted in different colours (red, yellow, green, and blue represent spontaneous baseline (B1), uncoupled reactive baseline (B2), tethered stage (CR) and untethered stage (DC), respectively).The entire procedure was planned to span 12 minutes.To investigate whether various AI classification architectures could differentiate between initial infant exploratory movement during tethered interaction from possible intentional activity later in the tethered phase, the current project split the tethered phase into two segments surrounding the most typical timing for infant discovery as indicated by coordination dynamics analysis (i.e., ~ 90s into coupling)20 .These two segments were the first minute of infant~mobile coupling (CR1, shaded light green) and the third minute of coupling (CR2, shaded dark green).Outlined in black, we planned to classify movement using two minutes of data for each of B1, B2, CR and DC (with CR split into CR1 and CR2).The bottom panel illustrates the available data for each stage for each of the five infants (S1-S5) relative to the idealized plan.Sliding windows (window width = 5s; overlap = 1s) were used across all experimental stages to assess fluctuations in classification accuracy across time (indicated by small black windows drawn to scale and arrows).

Figure 8 .
Figure 8. Keypoint locations are recalculated in reference to the pelvis to produce person-centred data (left).The partitioning of the 3D space into n bins is referenced by vectors alpha and theta (right).

Figure 9 .
Figure 9. Temporal analysis for each subject.Five ascending steps illustrate the correct labels for five distinct classes: B1 (spontaneous baseline, no mobile motion), B2 (experimenter triggered mobile motion), CR1 (minute 1 of coupling), CR2 (2 minutes into coupling), and DC (decoupled, no mobile motion), respectively.Completely flat lines on each ascending step would indicate perfect labelling for each stage.100 sliding steps are equivalent to 1s.

Figure 13 .
Figure 13.The HJD for Full-body joints over 10s in the middle of each stage for S2.The x-axis is the number of the 64 bins used in the modified spherical coordinate system.The y-axis reflects the percentage of time the joint occupied a particular bin.Stages: B1 (spontaneous baseline), B2 (uncoupled reactive baseline), CR1 (the first minute of the tethered stage), CR2 (~ 2 min.into the tethered stage), DC (decoupled stage).
9, 10) to signatures of emergent agency based on fine temporal resolution coordination dynamics20 .It is an open question whether these AI techniques can provide further insight into the dynamics of agency emergence in individual infants.https://doi.org/10.1038/s41598-024-66312-6 training points surrounding the test point.LDA models the decision boundary between classes.It identifies a linear combination of features that best differentiates the classes by maximizing the ratio of between-class variance to within-class variance.When applied to new data, this linear combination of features serves as a linear classifier.LDA operates on the Gaussian Distribution of input variables and can be used for both binary and multi-class classification.Both approaches have been previously applied to infant pose-based features Reports | (2024) 14:15580 | https://doi.org/10.1038/s41598-024-66312-6www.nature.com/scientificreports/

Table 1 .
Performance of all models: average sliding window accuracy (%) with standard deviation.*Foreachjoint-type, the model with greatest classification accuracy is in bold.**For each model, the joint-type with greatest classification accuracy is highlighted in bold italic.Table1presents the average classification accuracy for the various machine and deep learning approaches tested.The approximate chance level for accurately classifying the five experimental stages used here is 20%.Table 1 also reports accuracy levels for four fused features.This set of fused features was created by concatenating the histogram features based on the number of joints available in the datasets, forming a long 1D vector.The four resultant fused joints were (1) Hands: fused features from L &R hand, (2) Knees: fused features from L &R knee, (3) Feet: fused features from L &R foot, and (4) Full-body: fused features from L &R hand, L &R knee, and L &R foot.

Table 2 .
Average Displacement Rate of Feet (m/min).